Computer Science 1

Computer lab 09, Semester 1, 2023/24/1

Course Number: BTPS1210BA-E
Contact Information: Please contact me through email anytime:
Author

Kálmán Abari

Published

August 28, 2024

Learning outcomes
Students will learn how to run descriptive analysis in jamovi.

Jamovi - Descriptive statistics II.

Problem 1

In this section you will learn how to explore the distribution of categorical variables using graphical tools.

The comics.omv dataset has information on all comic characters that have been introduced by DC and Marvel (see Lab 07 - Problem 1).

  1. Create a bar chart for the variables align, gender and id. Select Analyses / Exploration / Descriptives menu. To fully investigate the distribution of categorical variables (nominal or ordinal variables), we perform all three activities:

    • display valid and missing values
    • display frequency table
    • display a bar chart.

Valid and Missing values

Valid and Missing values

Frequency tables

Frequency tables

Bar chart

Bar chart for id variable

Bar chart for align variable

Bar chart for gender variable

As you can see, the bar plot provides the same information as the frequency table. The interpretation of the figures is simpler, but less accurate than the frequency table.

For example, the predominance of male characters among all the characters is very noticeable, as can be seen in the last picture. If we are interested in the exact numbers, we see that the 16421 male characters are 74% of all characters.

  1. Save and send the file comics.omv to the abari.kurzus@gmail.com. The subject of this email is Lab08 - Problem 1.
Bar chart

A bar chart is a graphical representation used to display the distribution or frequency of categorical data. It is a common and effective way to visualize and summarize the data for qualitative variables, which represent categories or groups rather than numerical values. Bar charts are particularly useful for comparing the relative frequencies or counts of different categories within a dataset.

Here are the key features and characteristics of a bar chart:

  • Categorical Data: Bar charts are used to represent data with categorical variables, such as types of fruits, colors, survey responses (e.g., strongly agree, agree, neutral, disagree, strongly disagree), or any other non-numeric categories.

  • Rectangular Bars: In a bar chart, each category is represented by a rectangular bar. The length or height of the bar is proportional to the frequency or count of that category in the dataset. In jamovi count is used.

  • Categories on the X-Axis: The categories are typically displayed on the horizontal X-axis, making it easy to compare different groups. Each category is labeled on the X-axis, and the bars are positioned above or adjacent to the corresponding labels.

  • Frequency on the Y-Axis: The frequency or count of each category is typically represented on the vertical Y-axis. The scale on the Y-axis depends on the range of the data, and it can be in the form of counts, percentages, or other relevant units.

  • Bar Orientation: Bar charts can be either horizontal or vertical, depending on the data and the preferred presentation style. Vertical bar charts are more common, but horizontal bar charts are used when the category labels are long or when you want to emphasize differences in length more easily.

Bar charts are a straightforward and effective way to visualize categorical data, making it easy to compare the relative frequencies or counts of different categories within a dataset. They are commonly used for data presentation in reports, presentations, and publications, and they help provide a clear, visual summary of the distribution of categorical data.

Problem 2

In this section, you will learn how to explor 2 categorical data.

The comics.omv dataset has information on all comic characters that have been introduced by DC and Marvel (see Lab 07 - Problem 1).

A common way to represent the number of cases that fall into each combination of levels of two categorical variables (align and id) is with a contingency table.

  1. Select Analyses / Exploration / Descriptives menu. Move the align variable to the ‘Variables’ box and the id variable to the ‘Split by’ box.

Contingency table

Contingency table

The output tells us that the most common category, at a count of 4493, was bad characters with secret identities. While tables of counts can be useful, you can get the bigger picture by translating these counts into a graphic.

  1. We’re interested in the relationship between two categorical variables, which is represented well by a bar chart.

Bar chart for a contingency table

Bar chart for a contingency table

The X-axis of the bar chart above shows the alignments of the characters, and the colouring is based on identity. Since we have two categorical variables, the data are located along two dimensions (X-axis and colouring). It is also easy to see from the figure that most of them are bad characters with secret identities.

  1. There is an even better contingency table drawing option in jamovi than the one seen before. Select Analyses / Frequencies / Independent Samples menu. Move the align variable to the ‘Rows’ field and the id variable to the ‘Columns’ field. Activate the ‘Rows’ check box on the ‘Cells’ tab and the ‘Bar Plot’ check box on the ‘Plots’ tab.

Contingency table

Contingency table

Bar chart for a contingency table

In the contingency table above, we see a little more information and in a different arrangement than in the previously created contingency table. Of course, it can also be seen from this contingency table that most of them are characters with bad, secret identities. However, thanks to the percentage values taken per line, we can also see that among the bad characters there is a predominance of people with a secret identity (63% of the bad characters have a secret identity), however, in the case of the other alignments, the public and secret identities are roughly balanced (Good: 48%-41%, Neutral 42%-41%, Reformed Criminals 50%-50%).

  1. Save and send the file comics.omv to the abari.kurzus@gmail.com. The subject of this email is Lab08 - Problem 2.
Contingency table

A contingency table, also known as a cross-tabulation or a two-way table, is a tabular representation used in descriptive statistics to summarize and display the relationships between two categorical variables. It provides a way to examine the distribution of one variable with respect to another or to assess the association, dependence, or independence between the variables. Contingency tables are especially useful when you want to analyze and visualize the relationship between two categorical variables and see how they are related or how they co-occur.

Key characteristics of a contingency table include:

  • Rows and Columns: Contingency tables have rows and columns, each corresponding to different categories or levels of the two categorical variables being studied.

  • Counts or Frequencies: The cells of the table contain counts or frequencies that represent the number of observations falling into each combination of categories from the two variables. The intersection of a row and a column represents a specific category combination.

  • Marginal Totals: Contingency tables often include marginal totals, which are the sums of counts for each row and each column. These totals provide information about the overall distribution of each variable.

  • Row and column percentages in a contingency table are ways to express the relative proportions and relationships between the categories of the variables being examined. They help to provide a more detailed and intuitive understanding of the distribution of data in the table. Row percentages and column percentages are calculated based on the counts or frequencies in the contingency table, and they are used to assess the relationships between the two categorical variables being studied.

    • Row Percentages, also known as conditional percentages, are calculated for each row in the contingency table. They express the proportion of cases in a specific row relative to the total number of cases in that row. The formula for row percentages is: (Count in a Specific Cell) / (Total Count in the Row) * 100%. Row percentages allow you to see how each category of one variable is distributed within the categories of the other variable. They help you understand the relative distribution of one variable within the context of the other.

    • Column percentages are calculated for each column in the contingency table. They express the proportion of cases in a specific column relative to the total number of cases in that column. The formula for column percentages is: (Count in a Specific Cell) / (Total Count in the Column) * 100%. Column percentages provide insights into how each category of one variable is distributed within the categories of the other variable. They help you assess how the distribution of one variable is related to the distribution of the other.

Problem 3

In this section, you will learn how to explore numerical data using graphical tools.

In this problem, we’ll broaden our tool box of exploratory techniques to encompass numerical data. Numerical data are data that take the form of number, but where those numbers actually represent a value on the number line. The dataset that we’ll be working with is one that has information on the cars that were for sale in the US in a certain year.

  1. Open this link and download the Cars.csv from there (or from here).

  2. You can study the description of the database in the link above. Based on this, let’s set the variables.

Review the variables

Review the variables
  1. Learn that we have 428 observations, or cases, and 19 variables.

Number of rows (number of observations)

Number of columns (number of variables)
  1. Learn more about each of the variables. Determine the number of valid and missing values for all variables. Choose a space-saving solution.

Valid and missing values

Move all variables to the Variable box. To save space, select the Variable across rows list item in the Descriptives list.

The Variable across rows list item

Valid and missing values
  1. The most direct way to represent numerical data is a dot plot, where each case is a dot that’s placed at it’s appropriate value on the X-axis, then stacked as other cases take similar values. This is a form of graphic where there is zero information loss; you could actually rebuild the dataset perfectly if you were given this plot. As you can imagine, though, these plots start to get difficult to read as the number of cases gets very large. Let’s display a dot diagram of the number of cylinders (ncyl), horsepower (horsepwr) and city consumption (city_mpg) of the cars. Select Analyses / Exploration / Descriptives menu. Move the ncyl, horsepwr, and city_mpg variables to the ‘Variables’ box. On the ‘Plot’ tab, select the ‘Data’ checkbox and select the ‘Stacked’ item.

Display dot plot

Dot plot for ncyl

Dot plot for city_mpg

Dot plot for horsepwr

As we can see, the display of the dot plot in jamovi is not perfect, since the frequency of values with too high a frequency cannot be tracked due to the clipping of values beyond the margin. Nevertheless, we receive information that, for example, the values 4, 6 and 8 are the most common in the number of sounds, and most cars have a city consumption of around 20 and a horsepower of around 200.

Dot plot

A dot plot, also known as a dot chart or dot graph, is a simple and straightforward graphical representation used in descriptive statistics to display the distribution of a numerical variable. It is particularly useful for visualizing the distribution of numerical data, showing individual data points as dots along a single axis. Each dot represents a single data point, and the arrangement of the dots provides a clear view of the data’s distribution.

Key features of a dot plot include:

  • Individual Data Points: In a dot plot, each data point is represented by a dot placed at the appropriate position along a numerical axis. The vertical or horizontal axis can represent the variable being measured.

  • Stacking Dots: When multiple data points have the same value, the dots are stacked vertically on top of one another at the corresponding value on the axis. This stacking allows you to see the frequency of data points at each value.

  • Symmetry and Clustering: Dot plots can reveal patterns in the data, such as symmetry, skewness, gaps, clusters, or outliers. The arrangement of dots provides insights into the data’s central tendency and spread.

  • Simplicity: Dot plots are relatively simple to create and interpret, making them a useful tool for quick data exploration and comparison.

  1. One of the most common plots to use is a histogram, which is mapping the height of the bar to the number of cases that fall into that bin. Because of the binning, it’s not possible to perfectly reconstruct the dataset: what we gain is a bigger picture of the shape of the distribution. If the stepwise nature of the histogram irks you, then you’ll like the density plot. The density plot represents the shape of the histogram using a smooth line. This provides an ever bigger picture representation of the shape of the distribution, so you’ll only want to use it when you have a large number of cases. Values from variables can be used to construct a boxplot, where the box represents the central bulk of the data, the whiskers contain almost all the data, and the extreme values are represented as points.

We will investigate the distribution of mileage (city_mpg) across a categorical variable. (The higher the city MPG, the more miles you can travel on the same amount of fuel, resulting in better overall mileage for your vehicle.) First, plot a histogram of city_mpg faceted by suv, a logical variable indicating whether the car is an SUV or not. Select Analyses / Exploration / Descriptives menu. Move the city_mpg variable to the ‘Variables’ box and the suv variable to the ‘Split by’ box.

Histogram for mileage (city_mpg)

Histogram for mileage (city_mpg)

If we compare the midpoints of each area, we can see that SUVs have slightly worse mileage.

  1. Replace the histogram used in the previous point with a density plot and a box plot. Select Analyses / Exploration / Descriptions for the density plot and again for the boxplot.

Density plot for mileage (city_mpg)

Density plot for mileage (city_mpg)

Boxplot for mileage (city_mpg)

Boxplot for mileage (city_mpg)
  1. The mileage of a car tends to be associated with the size of its engine (as measured by the number of cylinders). To explore the relationship between these two variables, you could stick to using histograms, but in this point you’ll try your hand at two alternatives: the box plot and the density plot. We have already looked at the number of cylinders (ncyl), now let’s see what values it has at all, and how many times it takes. To do this, let’s first convert it to an ordinal variable and write a frequency table on it.

Convert ncyl to ordinal)

Frequency table for ncyl

Frequency table for ncyl

A quick look at the frequency table shows that there are more possible levels of ncyl than you might think. Here, restrict your attention to the most common levels.

  1. Transform a new variable ncyl_468 from ncyl to get a new variable with only 4, 6, and 8 cylinders.

Transform a variable

Transformation

Transformation

Graphical tools for numeric variable splity by common cylinders

Histogram and density plot of city_mpg separated out by ncyl_468

Side-by-side box plots of city_mpg separated out by ncyl_468

From the figures above we can say:

  • The highest mileage cars have 4 cylinders.
  • The typical 4 cylinder car gets better mileage than typical 6 cylinder car, which gets better mileage than the typical 8 cylinder car.
  • Most of the 4 cylinder cars get better mileage than even the most efficient 8 cylinder cars.
  • The variability in mileage of 8 cylinder cars is smaller then the variabilty in mileage of 4 cylinder cars.
  1. Save and send the file Cars.omv to the abari.kurzus@gmail.com. The subject of this email is Lab08 - Problem 3.
Histogram

A histogram is a graphical representation used in descriptive statistics to visualize the distribution of a variable, particularly for numerical data. It provides a way to display the frequency or count of data points falling within specific intervals or “bins.” Histograms are useful for understanding the shape, central tendency, and spread of a dataset, as well as identifying patterns and outliers.

Key characteristics of a histogram include:

  • Bins or Intervals: The range of the data is divided into intervals, or bins, along the horizontal axis (X-axis). Each bin represents a specific range of values.

  • Frequency or Count: The vertical axis (Y-axis) displays the frequency or count of data points that fall into each bin. The height of each bar in the histogram represents how many data points are within the corresponding interval.

  • Bar Width: The width of the bars in a histogram can vary depending on the choice of bin width. The choice of bin width can affect the appearance and interpretation of the histogram.

  • Continuous Data: Histograms are typically used for continuous numerical data. For discrete data, a similar visualization called a bar chart may be used.

  • Sum of Bar Heights: The sum of the heights of all bars in a histogram is equal to the total number of data points in the dataset.

Histograms help to reveal various characteristics of the data distribution:

  • Shape: The shape of the histogram can provide insights into whether the data is normally distributed, skewed, bimodal (having two peaks), or exhibits other patterns.

  • Central Tendency: The central tendency of the data can be identified by looking at the peak or mode of the histogram.

  • Spread: The spread or variability of the data can be observed by examining the width and dispersion of the bars in the histogram.

  • Outliers: Outliers, which are data points significantly different from the majority, can be identified as data points located far from the bulk of the data in the histogram.

Histograms are widely used in data analysis, research, and data visualization to better understand the characteristics of a dataset. They are a valuable tool for summarizing and communicating the distribution of numerical data, making it easier to draw conclusions and make data-driven decisions.

Density plot

A density plot, also known as a kernel density plot, is a graphical representation used in descriptive statistics to visualize the distribution of a variable, particularly for numerical data. It provides a smooth and continuous estimation of the probability density function (PDF) of the data, allowing for a more detailed and nuanced view of the data distribution. Density plots are especially useful when you want to understand the shape, central tendency, and spread of a dataset and when you want to see the underlying probability distribution.

Key features of a density plot include:

  • Smooth Curves: Density plots are typically displayed as smooth, continuous curves, rather than discrete bars as seen in histograms.

  • Continuous Estimation: They estimate the PDF of the data using mathematical functions, such as Gaussian kernels. This results in a continuous representation of the data distribution.

  • No Bins: Unlike histograms, density plots do not rely on predefined bins or intervals. The curves are created through mathematical transformations of the data.

  • Area Under the Curve: The area under the density curve sums to 1, representing the entire probability distribution. This allows for easy interpretation of probabilities associated with different data values.

  • Bandwidth: The bandwidth parameter in density estimation affects the smoothness and sensitivity of the density plot. A smaller bandwidth results in a more detailed but noisy plot, while a larger bandwidth results in a smoother but less detailed plot.

Density plots are used to reveal various characteristics of the data distribution:

  • Shape: The shape of the density plot can provide insights into whether the data is normally distributed, skewed, bimodal (having two peaks), or exhibits other patterns. The peaks and valleys of the curve indicate modes in the data.

  • Central Tendency: The highest point (mode) of the density plot represents the central tendency of the data.

  • Spread: The width of the curve at different points reflects the spread or variability of the data. Wider sections indicate higher variability.

  • Outliers: Outliers, which are data points significantly different from the majority, can be seen as isolated peaks or troughs in the density plot.

Density plots are valuable for summarizing the distribution of numerical data and are especially useful for comparing the distributions of multiple groups or variables. They are widely used in data analysis, research, and data visualization to gain a deeper understanding of data distribution.

Box plot

A box plot, also known as a box-and-whisker plot, is a graphical representation used in descriptive statistics to display the distribution of a variable, particularly for numerical data. It provides a visual summary of the central tendency, spread, and potential outliers within the data. Box plots are useful for quickly assessing the shape and variability of the data and for comparing distributions across different groups or variables.

Key characteristics of a box plot include:

  • Box: The central rectangular box in the plot represents the interquartile range (IQR), which encompasses the middle 50% of the data. The lower and upper boundaries of the box correspond to the first quartile (Q1) and the third quartile (Q3), respectively. The width of the box is proportional to the spread of the data within the IQR.

  • Whiskers: Lines (whiskers) extend from the edges of the box. They typically reach out to the minimum and maximum values that fall within a specified range, which is usually 1.5 times the IQR. Any data points beyond this range are considered potential outliers and are plotted individually as dots or asterisks.

  • Median Line: A horizontal line inside the box represents the median (Q2), which is the middle value of the dataset when it is ordered.

  • Symmetry and Skew: The shape and length of the whiskers can reveal information about the symmetry or skewness of the data distribution.

  • Outliers: Outliers, which are data points that fall outside the whiskers, are displayed individually to help identify extreme values.

Box plots are used to address various aspects of the data distribution:

  • Central Tendency: The median line shows the central tendency of the data. If the box is symmetric, the median is the exact middle value. If the box is skewed, the median is closer to one end.

  • Spread: The length of the whiskers and the width of the box indicate the spread and variability of the data. A longer box and shorter whiskers suggest a concentrated distribution, while longer whiskers indicate greater dispersion.

  • Outliers: Box plots make it easy to identify and visualize potential outliers within the data.

  • Comparison: Box plots are particularly useful for comparing the distributions of multiple groups, categories, or variables in a single plot, making them a valuable tool for data exploration and visualization.

Box plots provide a concise and informative summary of numerical data, making them a popular choice for exploratory data analysis, research, and data visualization in various fields, including statistics, business, and science.

Problem 4

In this section, you will learn how to apply your knowledge of descriptive statistics to categorical and numerical variables.

The survey.omv dataset has information on a survey of over 200 students at the University of Adelaide (see Lab 07 - Problem 2).

  1. Formulate 4 questions about the survey database. Answer them according to Problem 1-3 in this Lab.

  2. The questions and the text answers, as well as the screenshots of the jamovi, are summarised in a survey_2.html file.

  3. Send your survey_2.html and survey_2.omv to the abari.kurzus@gmail.com. The subject of this email is Lab08 - Problem 4. The better the solution, the more badges you get (from four to zero).